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1.  Introduction 


This  document  describes  a  series  of  kernel  benchmarks  for  the  PCA  program.  Each  kernel  bench¬ 
mark  is  an  operation  of  importance  to  DoD  sensor  applications  making  use  of  a  PCA  architecture. 

The  kernel-level  benchmarks  have  been  chosen  to  stress  both  computation  and  communication 
aspects  of  the  architecture.  “Computation”  aspects  include  floating-point  and  integer  performance, 
as  well  as  the  memory  hierarchy,  while  the  “communication”  aspects  include  the  network,  the 
memory  hierarchy,  and  the  I/O  capabilities.  The  particular  benchmarks  chosen  are  based  on  the 
frequency  of  their  use  in  current  and  future  applications.  They  are  drawn  from  the  areas  of  signal 
processing,  communication,  and  information  and  knowledge  processing. 

Source  code  for  most  of  the  kernel-level  benchmarks  is  provided  in  the  MATLAB  program¬ 
ming  language,  with  a  C  and  C-h-  version  to  be  released  in  the  summer  of  2003.  The  specification 
of  the  benchmarks  in  this  document  is  meant  to  be  high-level  and  largely  independent  of  the  imple¬ 
mentation.  MATLAB  code  is  not  provided  for  the  comer-turn  or  database  kernels.  The  comer-tum 
kernel  is  described  in  the  C  language,  as  it  involves  explicit  memory  operation.  The  database 
kernel  is  described  relative  to  a  generic  database  interface. 
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2.  Metrics 


For  each  benchmark,  a  set  of  problem  sizes  are  defined.  Throughout  this  section,  we  refer  to 
the  kernel  by  the  index  k,  and  refer  to  particular  data  sets  for  a  given  kernel  as  di,  where  i  = 
1,2,...,  Nk,  and  Nk  varies  from  kernel  to  kernel.  We  assume  that  file  data  for  the  problem  begins 
in  a  “staging  area”  accessible  to  the  PCA  computation  units  (“main  memory”  or  “an  FO  stream”) 
and  must  be  moved  into  local  memory.  For  the  initial  realization  of  the  benchmarks,  the  staging 
area  can  be  main  memory.  Final  measurements  in  the  next  phase  of  the  PCA  program  should  have 
an  yO  stream  as  the  staging  area. 

There  are  two  major  metrics  of  interest  for  each  problem  size.  The  first  is  the  total  time  or 
latency,  Li  {k,  di),  to  perform  kernel  k  for  a  data  set  size  di  using  a  single  PCA  prototype  integrated 
circuit  (hereafter,  a  PCA  chip).  This  measurement  should  include  both  computation  time  and  the 
time  to  move  the  data  for  the  problem  from  the  staging  area  (off  the  PCA  chip)  to  a  computation 
or  operation  area  (on  the  PCA  chip). 

The  second  major  metric  of  interest  is  the  sustained  achievable  throughput,  T{k,  di).  For  each 
kernel  k  and  problem  size  di,  a  measure  of  the  workload,  W (fc,  di),  is  defined  in  an  operation- 
dependent  and  system-independent  way.  (For  floating-point  computation  operations,  W  is  the 
floating-point  operation  count,  while  for  communication  operations,  W  is  the  number  of  bytes 
transferred.)  The  sustained  achievable  throughput  is 


T{k,  di)  =  lim 

n— >oo 


nW{k,di) 

Ln{k,di)  ’ 


(1) 


where  L„(jt,  di)  is  the  total  time  to  solve  n  problems  of  the  given  type  using  the  PCA  chip.  As 
above,  Ln{k,  di)  includes  the  time  to  move  the  data  from  the  staging  area  to  an  operation  area. 

There  are  clear  trade-offs  between  throughput  and  latency.  If  the  entire  PCA  chip  is  being  used 
to  solve  kernel  k  for  data  set  di,  then  L„  =  nLi  and  T  =  W/Ly.  In  some  cases,  however,  an 
operation  will  be  able  to  take  advantage  of  pipelining  and  perform  multiple  computations  of  the 
same  type  at  the  same  time,  resulting  in  higher  throughput.  Obviously,  the  extent  to  which  this  can 
be  accomplished  will  depend  on  the  input  bandwidth  of  the  PCA  chip.  To  measure  the  throughput 
for  our  purposes,  it  is  sufficient  to  measure  L„  for  a  value  of  n  that  is  sufficiently  large  (at  least 
n  >  10,  and  preferably  n  >  100). 

For  embedded  systems  we  are  interested  in  the  efficiency  of  the  operation,  that  is,  the  use  of 
resources  relative  to  the  potential  of  the  device.  In  general  the  efficiency  E(k,  di)  is  defined  as 


where  U{k)  is  the  kernel-dependent  upper  bound  or  peak  performance  of  the  chip.  The  definition 
of  U (k)  is  linked  to  the  definition  of  the  workload.  When  W  is  in  floating-point  operations,  U (k) 
is  the  theoretical  peak  floating-point  computation  rate  (based  on  the  clock  rate  and  the  number  of 
floating-point  units).  For  a  communication  operation,  where  workload  is  defined  in  bytes,  U{k)  is 
the  theoretical  peak  bandwidth  between  the  communicating  units.  For  benchmarks  other  than  the 
signal  processing  and  communication  benchmarks,  efficiency  is  difficult  to  calculate  because  peak 
performance  for  the  corresponding  workloads  cannot  easily  be  defined.  For  example,  the  workload 


3 


for  the  database  benchmark  is  in  transactions,  and  there  does  not  exist  an  easily  calculable  peak 
performance  for  the  number  of  transactions  performed.  In  these  situations,  efficiency  cannot  be 
calculated. 

One  of  the  key  metrics  for  the  PCA  program  is  stability  of  the  performance.  Kuck  [8,  p.l68ff] 
defines  stability  as  the  ratio  of  the  minimum  achieved  performance  to  the  maximum  achieved 
performance  over  a  set  of  data  set  sizes  and  programs.  Stability  is  defined  in  two  senses  for  the 
kernel  benchmarks,  a  per-kemel  sense  and  an  overall  sense.  The  per-kemel  stability  is  reflected  by 
a  metric  called  data  set  stability,  Sd,  defined  as  the  stability  for  a  particular  kernel  over  all  data  sets 
for  that  kernel, 


c  (1c\  =  tmndiT{k,di) 

maxd,T{k,diy 


(3) 


Stability  across  all  kernels  poses  a  problem,  as  the  workloads  and  thus  the  throughput  calculations 
are  different  for  different  kernels.  However,  a  good  indication  of  the  overall  stability  can  be  gleaned 
from  the  geometric  mean  of  the  kernel  stabilities, 


Sd  = 


\ 


*=1 


(4) 


Finally,  for  embedded  systems,  an  important  metric  is  the  achieved  performance  per  unit  power 
consumed  by  the  PCA  chip. 


C{kA) 


T{k,di) 

P{k,diy 


(5) 


where  P{k,  di)  is  the  overall  power  consumed  during  the  operation.  This  normalized  quantity  C 
gives  some  indication  of  the  “cost”  of  executing  the  benchmark  on  the  given  PCA  chip.  Obviously, 
this  metric  ignores  power  consumed  by  other  elements  of  the  system,  but  allows  comparison  with 
commercially  available  processors  using  the  same  metric.  Performance  metrics  per  unit  size  and 
weight  are  omitted,  as  the  processing  unit  is  perceived  to  be  less  of  a  driver  for  either  of  these 
quantities  than  for  power  consumed  by  the  system. 

Along  with  the  specified  measurements,  developers  are  asked  to  provide  their  implementa¬ 
tion  of  the  algorithm  and  a  description  of  chip  resource  usage.  The  specific  implementation  of 
the  benchmark  used  to  achieve  the  measured  latency  and  throughput  is  of  great  interest  for  several 
reasons.  One  important  reason  is  to  compare  the  throughput  and  latency  achievable  by  different  ap¬ 
proaches.  For  this  comparison,  the  availability  of  multiple  implementations  of  the  same  algorithm 
on  the  same  architecture  using  different  approaches  would  be  ideal. 

Another  important  reason  to  examine  the  implementation  has  to  do  with  the  workload,  W {k,  di), 
which  is  defined  for  a  particular  implementation  of  the  kernel.  This  standard  workload  and  imple¬ 
mentation  allows  comparison  of  different  architectures.  If  an  additional  algorithm  with  a  signifi¬ 
cantly  diflFerent  workload  is  also  implemented,  the  value  of  W  must  be  adjusted  or  the  workload 
is  invalid.  Therefore,  developers  are  asked  to  provide  an  estimate  of  the  workload  for  implementa¬ 
tions  that  are  substantially  different  from  those  given  in  this  report  and  its  associated  code. 

In  summary,  developers  are  requested  to  measure  the  latency,  throughput,  and  power  consumed 
for  each  kernel  benchmark  k  and  data  set  di.  The  theoretical  peak  floating-point,  operation,  and 
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communication  rates  should  be  reported  for  the  chip.*  All  other  metrics  (efficiency,  stability  over 
problem  size,  stability  over  all  kernels,  and  performance  per  unit  power)  are  derived  from  chip 
parameters  and  the  measured  quantities.  Other  statistics  such  as  variance  may  be  appropriate  also 
and  may  be  calculated  from  these  results.  The  desired  quantities  are  summarized  in  Table  1. 


Table  1. 

Benchmark  metrics. 


Parameter 

Description 

Calculation 

Total  latency  for  j  problems 

measured 

Wik,di) 

Workload 

given 

Tik,di) 

Throughput 

U{k) 

Performance  upper  bound  for  operation  type: 

floating-point 

clock  rate  *  floating-point  units 

communication 

bandwidth  between 
communicating  units 

integer 

clock  rate  *  integer  units 

E{k^  dj) 

Efficiency 

Stability  over  data  sets 

Si 

Mean  of  data  set  stabilities 

^ULtSaik) 

P{k,di) 

Power  consumed 

measured 

C{k,di) 

Performance-power  efficiency 

'These  rates  should  be  specific  to  an  agreed-upon  technology  as  discussed  at  the  Morphware  forum  meeting  in  July 
2002. 


5 


3.  Signal  Processing  Benchmarks 


Three  signal  processing  kernels  are  included  in  the  benchmark  set.  Each  presents  a  different  set  of 
characteristics  in  terms  of  operation  counts  and  memory  references. 

3.1  Finite  Impulse  Response  Filter  Bank 

The  finite  impulse  response  (FIR)  filter  is  one  of  the  basic  operations  in  signal  processing.  This 
kernel  implements  a  set  of  M  FIR  filters.  Each  FIR  filter  m,  m  G  {0, 1, . . . ,  M  —  1},  has  a  set  of 
impulse  response  coefficients  A:  G  {0 ...  —  1}.  If  the  length  of  the  input  vector  is  N,  the 

output  of  filter  m,  ym,  is  the  convolution  of  Wm  with  the  input  x^: 

K-l 

ym[i]  =  for  i  =  0, 1, ...  AT  -  1. 

fc=0 

The  filter  is  often  implemented  using  fast  convolution  with  the  fast  Fourier  transform  (FFT). 
The  most  efficient  implementation  depends  on  various  factors  including  the  size  of  the  filter  re¬ 
sponse  vector  (for  more  details,  see  [9]).  We  define  two  data  sets  in  Table  2,  one  for  a  short  kernel 
and  one  for  a  long  kernel.  In  the  provided  code,  each  filter  operates  on  one  row  of  the  input  data 
matrix.  Time-domain  and  frequency-domain  implementations  are  given. 


Table  2. 

FIR  filter  bank  input  parameters. 


Parameter 

Name 

Description 

Values 

Setl 

Set  2 

M 

Number  of  filters 

64 

20 

N 

Length  of  input  and  ouq)ut  vectors 

4096 

1024 

K 

Number  of  filter  coefficients 

128 

12 

W 

Workload  (Mflop) 

33 

1.97 

3.2  Singular  Value  Decomposition 

The  singular  value  decomposition  (SVD)  is  of  increasing  importance  in  signal  processing.  It 
is  an  advanced  linear  algebra  operation  that  produces  a  basis  for  the  row  and  column  space  of  the 
matrix  and  an  indication  of  the  rank  of  the  matrix.  In  adaptive  signal  processing,  the  matrix  rank 
and  the  basis  are  useful  for  reducing  the  effects  of  interference. 

Given  an  m  x  n  complex  matrix  A,  the  singular  value  decomposition  of  A  is 

A  =  UEV^,  (6) 

where  (/  is  a  unitary  matrix  of  size  m  x  m,  S  is  an  m  x  n  matrix  in  which  the  upper  nxn  portion 
is  a  diagonal  matrix  with  all  entries  real  and  sorted  in  descending  order,  and  V  is  an  n  x  n  unitary 
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Table  3. 


SVD  input  parameters. 


Parameter 

Name 

Description 

Values 

mfsn 

igg 

m 

Matrix  rows 

■ami 

MKil 

n 

Matrix  columns 

60 

Full  SVD  Algorithm 

Wn 

Fixed  workload  (Mflop) 

424 

36 

72 

Wf^ 

Workload  per  iteration  (Mflop) 

0.24 

0.088 

0.54 

Reduced  SVD  Algorithm 

Wrl 

Fixed  workload  (Mflop) 

101 

15 

72 

Wr2 

Workload  per  iteration  (Mflop) 

0.24 

0.88 

0.54 

matrix.  lfm>n,  then  define 


U  =  [C/a  Ub], 


where  is  size  m  x  n,  C4  is  size  m  x  (m  —  n),  and  Ea  is  of  size  n  x  n.  Then  A  =  Ua^aV^ 
is  called  the  reduced  SVD  of  the  matrix  A.  In  this  context  the  SVD  defined  in  equation  (6)  is 
sometimes  referred  to  as  ihejull  SVD  for  contrast.  Notice  that  Ua  is  not  unitary,  but  it  does  have 
orthogonal  columns.  When  m  <  the  reduced  SVD  can  be  similarly  defined  by  partitioning  V 
instead  of  U. 

For  signal  processing  applications,  we  are  typically  most  interested  in  the  reduced  decompo¬ 
sition,  in  the  matrix  U,  and  in  the  singular  values  (the  values  on  the  diagonal  of  S).  However, 
the  kernel  benchmark  code  produces  all  three  matrices,  and  can  produce  either  the  full  or  reduced 
decomposition  on  request.  The  data  matrix  sizes  of  interest  are  given  in  Table  3. 

There  are  three  major  steps  to  the  full  SVD  algorithm,  which  are  described  in  more  detail  in 
Golub  and  Van  Loan  [7].  First,  if  the  m  x  n  matrix  A  has  many  more  rows  than  columns,  a  QR 
factorization  is  performed.  This  step  is  done  if  m  >  5n/3  [7,  p.  252],  which  is  typically  the  case 
in  signal  processing. 

The  QR  factorization  of  an  m  x  n  matrix  A  with  m  >  n  is  a  pair  of  matrices  A  =  QR,  where 
the  unitary  matrix  Q  is  of  size  m  x  m  and  the  upper-triangular  matrix  R  is  of  size  mxn.  The 
usual  way  of  calculating  the  QR  factorization,  as  discussed  in  Golub  and  Van  Loan,  is  by  a  series 
of  Householder  transformations  [7,  Algorithm  5.2.1].  Define 

Q  =  [Qa  Qb], 


where  Qa  is  size  mxn,Qb  is  size  m x  (m—n),  and  Ra  is  size  n x n.  The  decomposition  A  =  QaRa 
is  referred  to  as  the  reduced  QR  decomposition  of  A.  Matrix  Qa  is  not  unitary,  but  it  has  orthogonal 


(7) 

(8) 


8 


columns.  The  reduced  QR  factorization  can  be  obtained  by  the  modified  Gram-Schmidt  algorithm 
described  in  Golub  and  Van  Loan  [7,  Algorithm  5.2.5].  If  a  full  SVD  is  being  performed,  the 
full  QR  is  computed:  if  a  reduced  SVD  is  being  performed,  a  reduced  QR  is  computed.  In  the 
remainder  of  this  exposition,  we  describe  the  full  SVD  algorithm.  We  assume  that  in  the  first  step, 
we  perform  a  full  QR  decomposition  to  produce 

A  =  UiR, 

where  i?  is  an  m  x  n  upper-triangular  matrix  and  is  an  m  x  m  orthogonal  matrix. 

In  the  next  step,  R  is  reduced  to  bi-diagonal  form,  to  consist  of  the  main  diagonal  and  a  single 
diagonal  of  entries  above  that,  with  the  remainder  of  the  entries  in  the  matrix  being  zero  [7,  p.  253]. 
This  is  accomplished  with  Householder  transforms,  producing 

R  =  U2BVf, 

where  U2  and  V2  are  unitary  and  the  m  x  n  matrix  B  is  bi-diagonal.  The  matrix  B  produced  at  this 
step  is  real  (no  longer  complex)  though  matrices  U2  and  V2  are  complex. 

The  final  step  is  an  iterative  reduction  of  B  to  diagonal  form  and  the  ordering  of  the  singular 
values.  This  is  accomplished  with  Givens  rotations  [7,  p.  454].  At  the  end  of  this  step  we  have 
produced  matrices  t/3  and  V3  such  that 

B  = 

so  that  the  singular  value  decomposition  of  the  original  matrix  A  can  be  expressed  as 

A  =  UiU2Uz'£.{VzV2f 
=  C/SV^, 


with  U  =  U1U2UZ  and  V  =  V3V2. 

As  the  exact  number  of  iterations  required  to  produce  the  SVD  is  based  on  the  data,  an  effi¬ 
ciency  measurement  must  take  into  account  the  actual  number  of  iteration  steps  performed.  We 
account  for  this  by  defining  the  workload  VF  as  a  linear  function  of  two  numbers  Wi  and  W2  given 
in  Table  3, 

W-Wi+n*  W2,  (9) 

where  n  is  the  number  of  iteration  steps  performed  in  the  reduction  of  B  to  diagonal  form,  W2  is 
an  estimate  of  the  number  of  floating-point  operations  per  iteration  step,  and  Wi  is  an  estimate  of 
the  number  of  floating-point  operations  in  the  remainder  of  the  algorithm.  The  estimates  Wi  and 
W2  for  complex  matrices  are  given  separately  for  the  full  and  reduced  SVD  algorithms  in  Table  3. 
The  number  n  for  a  given  data  set  must  be  “discovered”  in  the  course  of  the  execution  of  the 
benchmark. 

There  are  many  implementation  details  associated  with  achieving  high  performance  in  the  sin¬ 
gular  value  decomposition.  Examples  of  such  details  include  the  use  of  block  Householder  trans¬ 
forms  [7,  p.  213]  and  the  storage  of  the  Householder  transforms  and  Givens  rotations  that  produce 
U  and  V  rather  than  the  matrices  themselves  [3].  These  methods  are  not  demonstrated  by  the 
kernel  benchmark  code. 
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33  Constant  False  Alarm  Rate  Detection 

The  constant  false-alarm  rate  (CFAR)  detection  benchmark  is  an  example  of  data-dependent 
processing  designedUo  find  targets  in  an  environment  of  varying  background  noise.  The  benchmark 
subjects  a  subset  of  a  radar  data  cube  to  this  algorithm. 

Assume  a  data  cuibe  whose  dimensions  are  beams,  range,  and  dopplers.  During  CFAR  detec¬ 
tion,  a  local  noise  esfimate  is  computed  from  the  2Ncfar  range  gates  near  the  cell  k)  under 
test.  A  number  of  guard  gates  G  immediately  next  to  the  cell  under  test  will  not  be  included  in  the 
local  noise  estimate  (tfhis  number  does  not  affect  the  throughput).  For  each  cell  C {i,  j,  k),  the  value 
of  the  noise  estimate  T{i,  j,  k)  is  calculated  as 


G’hNcfar 


The  range  cells  involved  in  calculating  the  noise  estimate  for  a  particular  vector  are  shown  in 
Figure  1 .  For  each  cd!  C{i,  j,  k),  the  quantity  \C{i,  j,  k) \^/T{i,  j,  k)  is  calculated:  this  represents 
the  normalized  power  in  the  cell  under  test.  If  this  normalized  power  exceeds  a  threshold  /i,  the 
cell  is  considered  to  contain  a  target. 


Cell  Under  Test 


C(i^ 


hk) 


Figure  7.  Sliding  window  in  CFAR  detection.  The  example  shows  the  number  of  guard 
cells  G  =  1  and^e  number  of  cells  used  in  computing  the  estimate  N^far  —  3. 


An  efficient  implementation  of  the  CFAR  algorithm  makes  use  of  the  redundancy  in  the  com¬ 
putation  of  T  accorffing  to  the  formula  given  in  (10).  Note  that  the  relationship  between  T{i,  j,  k) 
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andT(i,j  + 1,/;)  is 

T{i,j  +  l,k)=T{i,j,k)  +  ^^{\C{i,j  +  l  +  G  +  Ncfar,k)\‘^ 

^^cfar 

+  \C{i,j-G,k)\^ 

-  \C{i,j-G-N,far,k)\‘^ 

-  \C{i,j  +  G+l,k)f). 

Using  this  relationship,  the  value  of  T  for  a  particular  set  of  Nrg  range  gates  can  be  calculated 
in  0{Nrg)  time,  that  is,  independent  of  the  values  of  G  and  Ncfav  Note  that  some  variations  of 
this  formula  and  equation  (10)  occur  at  the  boundary  areas;  refer  to  the  MATLAB  code  on  how  to 
handle  these  boundary  conditions. 

The  parameter  sets  for  the  GEAR  benchmark  are  shown  in  Table  4. 

Table  4. 


Parameter  sets  for  the  CFAR 

Kernel  1 

Benchmark. 

Name 

Description 

SetO 

Setl 

Set  2 

Set  3 

Units 

Nbm 

Number  of  beams 

16 

48 

48 

16 

beams 

Nrg 

Number  of  range  gates 

64 

3500 

1909 

9900 

range  gates 

Ndop 

Number  of  doppler  bins 

24 

128 

64 

16 

doppler  bins 

Ntgts 

Number  of  targets  that  will  be 
pseudo-randomly  distributed 
in  Radar  datacube 

30 

30 

30 

30 

targets 

Ncfar 

Number  of  CFAR  range  gates 

5 

10 

10 

20 

range  gates 

G 

CFAR  guard  cells 

4 

8 

8 

16 

range  gates 

mu 

Detection  sensitivity  factor 

100 

100 

100 

100 

w 

Workload 

0.39 

344 

94 

41 

Mflop 

11 


4.  Communication  Benchmark 


Many  signal  and  image  processing  applications  operate  on  multi-dimensional  data  in  multiple 
stages,  with  operations  focusing  on  a  different  dimension  in  each  subsequent  stage.  If  the  host 
platform  is  a  parallel  processor,  the  data  are  usually  distributed  across  the  nodes  to  exploit  data 
parallelism,  so  that  each  node  can  operate  in  parallel  on  its  portion  of  the  data  as  the  algorithm 
transitions  from  one  stage  to  the  next.  For  efficiency  reasons,  it  is  desirable  to  perform  a  corner- 
turn  of  the  data.  A  comer  turn  operation  is  defined  as  a  copy  of  the  object  with  a  change  in  the 
storage  order  of  the  underlying  data.  This  may  or  may  not  imply  transposition  of  the  computation 
object,  depending  on  the  implementation.  In  this  section,  we  describe  this  operation  in  more 
detail  and  describe  in  general  terms  an  abstracted,  high-level  application  that  requires  a  comer- 
tum  operation. 

An  application  that  requires  a  comer  turn  works  first  on  the  wvvi  of  an  input  matrix,  and  then 
on  the  columns  of  the  intermediate  result  matrix.  Mathematically,  one  of  the  most  basic  examples 
of  such  an  operation  is  a  multiplication  of  three  matrices, 

Z^B^XA,  (11) 

where  B  and  A  are  application-dependent  matrices,  X  is  the  matrix  of  input  data,  and  Z  is  the 
matrix  of  output  data.  An  example  situation  where  this  might  occur  would  be  a  filtering  operation 
followed  by  a  beamforming  operation.*  Suppose  that  we  perform  the  operations  of  Equation  (11) 
into  two  stages,  the  first  producing  Y  =  XA  and  the  second  producing  Z  =  B^Y.  Then  the  first 
stage  is  an  operation  in  which  an  entire  row  of  X  is  desired,  and  the  second  is  an  operation  in  which 
an  entire  column  of  Y  is  desired.  Thus,  the  two  stages  suggest  different  optimal  data  layouts. 

The  idea  behind  the  comer-tum  operation  is  to  preserve  data  locality  in  the  dimension  being 
operated  on.  Whether  or  not  a  mathematical  transpose  is  performed  is  implementation-dependent. 
In  multi-dimensional  arrays  in  the  C  programming  language,  the  last  array  index  is  continuous  in 
memory.  In  order  to  perform  a  comer  turn  of  a  C  language  array,  a  transpose  is  required;  that  is, 
the  order  of  some  of  the  dimensions  must  be  reversed  (see  Figure  2). 

Now  consider  an  implementation  with  the  property  that  storage  order  is  independent  of  the 
order  of  the  indexes.  In  such  an  implementation,  it  would  be  possible  to  do  a  comer  turn  without 
requiring  a  transpose  of  the  computation  object.  The  vector,  signal,  and  image  processing  library 
(VSIPL,  [10])  is  an  example  of  a  library  where  this  is  possible;  the  stride  parameters  of  a  VSIPL 
view  allow  the  VSIPL  copy  operation  to  re-arrange  the  underlying  data  without  changing  the  math¬ 
ematical  properties  (see  Figure  3).^  When  using  objects  with  this  property,  storage  order  may  be 
considered  to  be  a  mapping  issue,  whereas  when  using  standard  C  and  C++  arrays,  storage  order 
is  explicitly  embedded  in  the  application  program. 

The  discussion  above  does  not  consider  the  distribution  of  data  over  processors.  Distribution 
effectively  adds  a  level  of  memory  hierarchy  to  the  performance  of  a  comer  turn:  data  must  be 
copied  to  a  new  processor  as  well  as  re-arranged  on  the  new  processor.  Frequently,  an  all-to-all 

'a  filtering  operation  can  be  represented  as  a  matrix-matrix  multiply,  but  would  usually  not  be  implemented  in 
such  a  fashion.  Thus,  the  use  of  matrix  multiplication  here  is  an  addition^  level  of  abstraction  about  the  application. 

^It  would  also  be  possible  to  implement  standard  C/C++  data  structures  with  this  property:  use  of  VSIPL  here  is 
pureiy  a  matter  of  convenience. 
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The  C  code  to  perform  a  comer  turn  of  two-dimensional  array  A  into  two- 
dimensional  array  B  is 

//  Notice  dimensions  of  B  are  the  reverse  of  those  of  A 
int  A [NX] [NY] ,  B[Ny] [NX] ; 

for  (i  =  0;  i  <  NX;  i++) 

for  (j  =  0;  j  <  NY;  j++) 

B[j] [i]  =  A[i] [j] ; 

If  NX  and  NY  were  defined  at  compile  time  to  be  4  and  3,  respectively,  and 

■  0  1  2  ■ 

.-345 

6  7  8  ’ 

9  10  11  _ 

then  the  memory  area  underlying  A  is 
A={01234567891011}. 

Mathematically 

■  0  3  6  9  ] 

B=  1  4  7  10 

2  5  8  11  . 

and  after  execution  of  the  above  code  the  memory  area  underlying  B  is 
B={03691471025811}. 

Figure  2.  C  comer  turn  example.  In  matrix  B,  the  data  are  stored  in  comer-turned 
fashion  compared  to  matrix  A  and  B  is  the  transpose  of  A. 
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The  C  code  to  perform  a  comer  turn  from  VSIPL  matrix  view  A  into  VSIPL  matrix 
view  B  is 


//  Notice  dimensions  of  B  are  the  same  as  those  of  A: 
//  storage  order  is  different 
vsip_mview__i  *A  =  vsip_mcreate_i  (NX, 

NY, 

VSIP_ROW, 
VSIP_MEM_NONE) ; 

vsip_mview_i  *B  =  vsip_jncreate_i  (NX, 

NY, 

VSIP^COL, 
VSIP_MEM_NONE) ; 


vsip__mcopy_i_i  (A,  B)  ; 


If  NX  and  NY  were  defined  at  compile  time  to  be  4  and  3,  respectively,  and 

■  0  1  2  ’ 

^  ^  ^ 

6  7  8’ 

_  9  10  11  _ 

then  the  block  underlying  A  is 
A={01234567891011}. 


After  execution  of  the  above  code,  B  is  mathematically  the  same  as  A,  and  the 
block  area  underlying  B  is 


B={03691471025811}. 


Figure  3.  VSIPL  corner  turn  example.  Matrix  views  A  and  B  are  mathematically  the 
same  even  though  the  underlying  data  in  B  are  stored  in  comer-tum  fashion  compared  to 
matrix  A. 
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communication  operation,  in  which  every  processor  communicates  with  every  other  processor,  is 
required  as  part  of  a  distributed  comer  turn. 

Parameters  for  two  comer  turn  sizes  are  given  in  Table  5.  These  comer  turn  sizes  are  based 
on  current  applications.  For  this  particular  kernel  benchmark,  the  idea  of  timing  throughput  and 
latency  based  on  a  single  processor  is  acceptable,  but  it  would  be  preferable  to  get  a  sense  of  the 
throughput  and  latency  possible  for  a  multi-processor  comer  turn  of  the  data.  In  the  most  frequent 
occurrences  of  a  distributed  comer  turn  in  signal  processing,  the  source  and  destination  processor 
groups  are  either  identical  or  are  completely  disjoint.  The  derived  statistics  of  most  interest  for 
multi-processor  transfers  are  the  stability  of  the  throughput  over  the  number  of  processors  used, 
and  the  efficiency  of  the  transfer  versus  the  theoretical  peak  bandwidth. 


Table  5. 

Corner  turn  parameters. 


Parameter 

Name 

Description 

Values 

Setl 

Set  2 

M 

Matrix  rows 

50 

750 

N 

Matrix  columns 

5000 

5000 

W 

Workload  (Mbyte) 

2 

30 
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5.  Information  and  Knowledge  Processing  Benchmarks 


5.1  Pattern  Matching 

The  pattern  matching  kernel  entails  overlaying  two  length- JV  vectors  a  and  t  and  computing 
a  metric  that  quantifies  the  degree  to  which  these  two  vectors  match.  In  general,  the  vector  t  is 
chosen  from  a  set  of  reference  vectors  referred  to  as  the  template  library.  The  metric  used  for 
matching  is  the  weighted  MSB  (mean  square  error)  c, 

N 

Y^{wk*  (ak  -  tk)^) 

c  =  — - - ,  (12) 

J2wk 

k=l 

where  Wk,  k  =  1,2, . . . ,  N  is  the  vector  of  weights.  The  optimal  weights  for  the  feature-aided 
tracker  have  been  computed  empirically.  In  the  kernel  benchmark,  we  provide  a  generic  weighting 
vector. 

The  calculation  done  in  equation  (12)  is  performed  on  data  that  has  been  converted  to  decibels 
(the  “dB  domain”).  This  is  done  because  the  raw  power  output  from  a  signal  processing  system  can 
vary  by  many  orders  of  magnitude.  However,  conversion  of  patterns  between  the  power  domain 
and  the  dB  domain  is  performed  during  the  course  of  the  benchmark:  this  requires  the  use  of 
a  number  of  logarithm  and  exponentiation  functions.  The  operation  count  for  these  functions  is 
implementation-dependent,  and  so  the  workload  we  give  has  three  components:  a  count  of  the 
number  of  calls  to  the  exponent  function,  a  count  of  the  number  of  calls  to  the  logarithm  function, 
and  a  count  of  the  operations  in  the  rest  of  the  benchmark. 

Before  the  two  profiles  can  be  overlaid,  they  may  need  to  be  shifted  in  range  to  the  left  or  right, 
and  the  magnitude  of  the  profiles  may  need  to  be  scaled  to  match.  The  optimal  shift  and  gain  values 
can  be  found  through  brate  force  by  computing  the  MSB  for  each  combination  of  shift  and  gain 
values,  then  taking  the  minimum  MSB.  However,  by  noting  that  the  MSB  is  a  parabolic  function 
of  the  shift  and  gain,  we  can  find  the  optimum  shift  and  gain  values  at  the  global  minimum  by  first 
finding  the  optimal  shift,  then  finding  the  optimal  gain  value. 

The  parameters  of  interest  for  the  pattern  matching  benchmark  are  the  length  of  the  pattern 
vectors,  die  size  of  the  template  pattern  library,  and  the  number  of  shift  and  scale  operations  per¬ 
formed.  These  parameters  are  given  for  two  data  set  sizes  in  Table  6. 

5.2  Database  Operations 

We  consider  the  database  operations  in  the  context  of  a  tracking  application  that  stores  track 
records  in  a  database.  Bach  record  consists  minimally  of  coordinates,  a  time  value  known  as  a 
cycle,  and  some  associated  target  data.  During  each  cycle,  the  tracker  receives  a  set  of  target 
reports  from  a  radar.  For  each  target  report,  the  tracker  will  search  the  database  for  all  track  records 
that  could  be  associated  with  that  target  report,  based  on  their  location.  Target  reports  that  are  not 
associated  with  any  current  tracks  will  be  inserted  into  the  database  and  assigned  the  current  time 
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Table  6. 


Pattern  matching  parameters. 


Parameter 

Name 

Description 

Values 

Setl 

Set  2 

N 

Pattern  length 

64 

128 

K 

Number  of  patterns 

72 

256 

Sr 

Number  of  shifts 

22 

43 

Sm 

Number  of  magnitude  scalings 

13 

13 

Wi 

Workload:  log^o  function  calls 

W2 

Workload:  10®  function  calls 

4.61  X  10^ 

W3 

Workload:  Other  floating-point  ops  (flops) 

1.05  X  10® 

12.3  X  10® 

value.  The  oldest  records  (those  that  have  not  been  associated  with  a  target  report  after  a  specific 
number  of  cycles)  will  be  deleted. 

Our  goal  is  to  measure  the  performance  of  the  search,  insert,  and  delete  operations,  without  ever 
altering  the  contents  of  any  particular  record.  The  major  motivation  for  this  is  to  avoid  generating 
the  large  amount  of  data  necessary  for  the  database.  Thus,  we  do  not  actually  generate  and  maintain 
the  contents  of  the  database  itself,  only  the  indexing  structures.  Therefore,  the  structures  we  will 
use  will  only  store  the  values  we  need:  x  and  y  coordinates,  a  time  value  and  a  record  identifier, 
which  is  an  integer  index  of  or  pointer  to  a  data  record. 

For  an  application  of  this  type,  the  three  database  operations  that  are  needed  are: 

search  Look  up  and  retrieve  all  items  whose  characteristics  fall  into  a  given  range.  In  this  case, 
a  search  is  done  for  all  targets  within  a  specified  range  of  a  particular  (re,  y)  coordinate  pair, 
where  x  and  y  are  floating-point  numbers.  Given  a  set  of  sector  bounds  {xMin,  xmox,  yMin,  yMax}> 
this  search  can  be  expressed  in  a  fashion  approximating  the  structured  query  language  (SQL) 
as 

select  *  from  TrackDatabase 

where  (x  >  XMin  AND  x  <  xmox  AND  y  >  yMin  AND  y  <  yuax)- 

insert  Add  a  new  item  to  the  database.  This  can  be  expressed  in  a  SQL-like  fashion  as 
insert  into  TrackDatabase  values(id,  Xu,  2/u.  timey). 

delete  Remove  an  item  from  the  database,  expressed  in  a  SQL-like  fashion  as 
delete  from  TrackDatabase  where  {time  >  timef). 

Database  workloads  are  provided  based  on  two  scenarios,  a  kinematic  tracking  scenario  similar 
to  the  parameters  proposed  for  the  integrated  radar-tracker  benchmark,  and  a  multi-hypothesis 
tracking  scenario  in  which  the  database  is  allowed  to  become  much  larger.  For  each  scenario,  the 
frequency  of  each  operation  (search,  insert,  delete)  is  specified.  The  parameters  that  define  these 
two  scenarios  are  given  in  Table  8. 
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Table  7. 

Track  record  entries. 


Name 

Explanation 

Type 

snr 

relative  strength  of  target  detection. 

float 

X 

estimated  x-coordinate  of  target  at  time  t 

float 

y 

estimated  y-coordinate  of  target  at  time  t. 

float 

X 

Estimated  velocity  of  target  in  the  x  direction  at  time  t 

float 

y 

Estimated  velocity  of  target  in  the  y  direction  at  time  t. 

float 

t 

Time  stamp  (an  integer  cycle  number)  for  the  last  target  re¬ 
port  associated  with  this  track. 

integer 

status 

Enumeration:  one  of  {  New,  Novice,  Established  }. 

Enumeration 

Classification  vector  from  overall  FAT  computations. 

60-element 
float  vector 

Estimated  aspect  angle  of  target  at  time  t. 

float 

hrr 

High  Range  Resolution  (HRR)  profile  for  last  target  report 
associated  with  this  track. 

64-element 
float  vector 

Pp 

Extrapolated  process  noise  covariance  matrix. 

4x4  double 
matrix 

Hypothesis 

Movement  model  hypothesized  for  last  track/report  associa¬ 
tion  for  this  track  (valid  models  are  linear  constant  velocity, 
linear  accelerating,  arc  of  circle  constant  speed,  unmodelled) 

Enumeration 

Table  8. 


Tracking  parameters. 


Parameter 

Values 

Description 

Setl 

Set  2 

Cycles  to  run 

100 

100 

Total  records  generated 

500 

102,400 

Initial  number  of  placed  targets 

450 

92,160 

Number  of  grid  rows,  M 

5 

32 

Number  of  grid  columns,  N 

5 

32 

Grid  row  search  size.  Ax 

2 

2 

Grid  column  search  size.  Ay 

2 

2 

Grid  target  density,  d 

20 

100 

Idle  cycles  before  deletion,  r 

10 

10 

Search  operations  per  cycle,  ns 

400 

100 

Matches  found  per  search  k 

80 

400 

Insert  operations  per  cycle,  ni 

20 

300 

Delete  operations  per  cycle  nd 

20 

300 

Workload  per  cycle  (transactions) 

440 

700 
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5.2.1  Test  Data 


The  test  data  for  the  database  can  be  thought  of  as  a  set  of  fixed-location  “targets”  that  appear 
and  disappear  periodically,  distributed  on  a  grid  of  size  M  x  N  km^.  We  specify  the  require¬ 
ment  that,  on  average,  there  should  be  k  matches  returned  for  each  search.  This  implies  that  in  a 
search  area  of  size  Ax  x  Ay  km^,  there  are  an  average  of  d  =  k/ {Ax Ay)  targets  per  square  km. 
Track  records  are  assumed  to  be  candidates  for  deletion  every  r  cycles.  Therefore,  the  initial  data 
generation  process  consists  of  generating  an  equal  number  of  “targets”  with  a  given  time  value 
between  [0,  r  —  1],  i.e.  r  time  cycles  must  elapse  before  any  record  with  a  certain  timestamp  can 
be  deleted.  We  then  distribute  the  targets  equally  in  a  uniform  pattern  in  each  of  MN  1-km^  grid 
portions,  placing  on  average  of  d  targets  in  each  of  the  squares.  TTie  initial  database  consists  of  all 
the  initially  placed  targets,  i.e.  those  that  do  not  have  a  t  value  of  zero. 

During  each  cycle,  the  generator  must  generate,  based  on  what  targets  are  placed  on  the  grid, 
search,  insert  and  delete  operations  to  be  performed  in  that  particular  cycle.  Each  search  command 
specifies  x  and  y  coordinates  chosen  from  a  uniform  random  distribution:  the  result  of  the  search  is 
all  the  targets  located  within  a  square  of  size  Ax  x  Ay  centered  on  those  coordinates.  Each  insert 
command  specifies  a  set  of  targets  to  add,  randomly  chosen  from  those  targets  not  currently  in  the 
database.  Similarly,  each  delete  command  specifies  a  set  of  targets  to  remove,  randomly  chosen 
from  those  targets  in  the  database. 

As  mentioned,  no  updates  are  needed  to  the  underlying  data.  Therefore,  for  a  particular  cycle 
c,  we  define  Cm  =  c  mod  r  (r  being  the  number  of  cycles  that  must  elapse  before  deletion).  Any 
inserted  targets  on  cycle  c  are  given  time  value  Cm,  and  the  targets  to  delete  are  chosen  from  those 
previously  inserted  targets  with  a  time  value  (c  •+■  1)  mod  r. 

To  generate  valid  insert  and  delete  commands,  the  generator  will  actually  have  to  keep  track 
of  which  targets  are  on  the  field  and  which  are  not.  This  is  accomplished  by  using  data  structures 
similar  to  the  ones  used  in  the  actual  test.  A  red-black  tree  is  used  to  keep  track  of  what  targets 
are  placed  and  which  ones  are  not,  and  a  linked  list  keeps  track  of  the  time  values  of  the  placed 
targets.  Using  this  information,  commands  can  be  generated  that  insert  unplaced  targets  or  delete 
previously  placed  targets.  The  command  generation  occurs  before  any  of  the  actual  benchmarking 
and  therefore  is  not  timed. 


5.2.2  Implementation 

Based  on  the  described  search  requirements,  we  maintain  two  structures,  one  which  allows 
searching  on  the  x  and  y  values,  and  one  that  allows  searching  on  the  time  stamp.  Each  of  these 
structures  stores  some  form  of  a  pointer  to  the  full  track  records,  which  we  will  randomly  fill  with 
data.  In  order  to  make  the  benchmark  more  representative  of  actual  database  operation,  the  data 
will  then  be  accessed  and  copied  in  memory  via  memcpy. 

A  red-black  tree  [2]  is  the  chosen  structure  for  allowing  the  x  and  y  search.  Since  the  fields  x 
and  y  are  always  searched  together,  the  x  values  are  indexed  using  a  red-black  tree  with  the  y  values 
stored  at  each  node  for  quick  reference.  The  red-black  tree  is  chosen  because  the  search,  traverse, 
insert,  and  delete  operations  are  all  asymptotically  0(log2n)  time.  To  search  for  all  targets  that 
have  X  values  in  a  given  range,  the  program  first  finds  the  smallest  match  not  less  than  the  minimum 
range  value,  then  finds  the  largest  match  not  greater  than  the  maximum  range  value.  The  program 
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then  traverses  the  tree  between  these  two  values,  comparing  the  y  values  at  each  node,  to  find  those 
that  are  in  the  specified  range.  The  asymptotic  performance  of  the  search  operation  is  then 

W,  =  {2  +  V)\og^N^,  (13) 

where  V  is  the  number  of  leaves  between  the  minimum  and  maximum  values  and  is  the  size 
of  the  database.  The  first  factor  of  2  in  (13)  is  from  the  two  searches  and  the  factor  of  V  comes 
from  the  traversal. 

The  chosen  implementation  for  searching  on  the  time  stamp  is  a  linked  list.  Each  list  is  kept 
sorted  by  the  time  stamp.  New  tracks  are  added  at  the  end  of  the  list,  which  takes  0(1)  time. 
Removing  random  tracks  unfortunately  takes  0(n)  time;  however,  because  we  always  remove  the 
oldest  records  firom  the  front  and  keep  the  list  in  sorted  order,  we  need  only  search  the  first  ^  of  the 
list,  where  r  is  the  number  of  cycles  that  must  elapse  before  records  are  deleted. 

For  a  workload  value  for  each  scenario,  we  do  not  use  the  asymptotic  values  above,  but  instead 
count  each  transaction  (search,  insert,  delete)  as  an  operation  to  be  performed.  This  workload 
value,  given  in  Table  8,  can  be  used  to  compute  throughput  for  the  database  kernel  (in  transactions 
per  second)  and  compare  among  different  architectures.  However,  an  efficiency  for  the  database 
kernel  benchmark  is  not  defined.  Peak  performance  for  the  database  kernel  benchmark  would  be 
calculated  from  the  rate  at  which  the  PCA  chip  can  perform  each  database  operation,  which  in  turn 
is  related  to  the  memory  hierarchy. 


5.2.3  Pseudocode 

Let  UT  and  FT  be  the  sets  of  unplaced  and  placed  targets  (respectively),  nt  be  the  total  number 
of  cycles  to  run  this  benchmark,  and  ns,  ni  and  nd  be  the  number  of  select,  insert  and  delete 
operations  to  run  each  cycle  (respectively).  The  timing  sequence  can  be  described  in  pseudocode 
as  shown  in  Figure  4. 


5.3  Graph  Optimization  via  Genetic  Algorithm 

Genetic  algorithms  [4,  6,  1 1]  have  become  a  viable  solution  to  strategically  perform  a  global 
search  by  means  of  many  local  searches.  The  basis  of  the  genetic  algorithm  methods  is  derived 
from  the  mechanisms  of  evolution  and  natural  genetics.  The  genetic  algorithm  that  is  being  used 
as  one  of  these  kernel  benchmarks  is  a  fairly  simple  version.  Many  modifications  are  possible  that 
can  enhance  the  performance  for  a  given  application,  and  some  small  enhancements  have  been 
made  to  enhance  the  performance  of  this  benchmark. 

A  genetic  algorithm  works  by  building  a  population  of  chromosomes  which  is  a  set  of  possible 
solutions  to  the  optimization  problem.  Within  a  generation  of  a  population,  the  chromosomes 
are  randomly  altered  in  hopes  of  creating  new  chromosomes  that  have  better  evaluation  scores. 
The  next  generation  population  of  chromosomes  is  randomly  selected  from  the  current  generation 
with  selection  probability  based  on  the  evaluation  score  of  each  chromosome.  The  simple  genetic 
algorithm  follows  the  structure  depicted  in  Figure  5.  Each  of  these  operations  will  be  described  in 
the  following  subsections. 
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/  Generate  ns  select(x,  y)  instructions,  where  x,y  U (0,  M)^ 

I  <—  Generate  ni  insert(i)  instructions,  where  t  G  UT  is  chosen  with  equal  likelihood. 
I  <—  Generate  nd  delete(t)  instructions,  where  t  G  FT  is  chosen  with  equal  likelihood. 
{Start  timer} 
for  all  i  G  /  do 
if  i  ~  select(a:,  y)  then 
S  *—  select(a;,  y) 
for  all  s  G  5  do 
bu  f  •«—  memcpy  ( s ) 
end  for 

else  if  i  insert(t)  then 
FT  insert(i) 

UT  — >  remove(t) 
else  if  i  ~  delete(t)  then 
FT  — >  removed) 

UT  <—  insert(0 
end  if 
end  for 

{Return  timer  value} 

‘\U{a,  b)  denotes  a  uniform  distribution  from  a  to  b. 

Figure  4.  Pseudo-code  for  database  timing  loop. 


Simple  Genetic  Algorithm  ( ) 

{ 

Initialization; 

Evaluation; 

while  termination  criterion  has  not  been  reached 

{ 

Selection^ndJReproduction; 

Crossover; 

Mutation; 

Evaluation; 

} 

} 


Figure  5.  Structure  of  a  simple  genetic  algorithm. 
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5.3.1  Initialization 


Initialization  involves  setting  the  parameters  for  the  algorithm,  creating  the  scores  for  the  sim¬ 
ulation,  and  creating  the  first  generation  of  chromosomes.  In  this  benchmark,  seven  parameters  are 
set: 

•  the  genes  value  (Genes)  is  the  number  of  variable  slots  on  a  chromosome; 

•  the  codes  value  (Codes)  is  the  number  of  possible  values  for  each  gene; 

•  the  population  size  (PopSize)  is  the  number  of  chromosomes  in  each  generation; 

•  crossover  probability  (Cross  over  Prob)  is  the  probability  that  a  pair  of  chromosomes  will 
be  crossed; 

•  mutation  probability  (MutationProb)  is  the  probability  that  a  gene  on  a  chromosome  will 
be  mutated  randomly; 

•  the  maximum  number  of  generations  (MaxGenerations)  is  a  termination  criterion  which 
sets  the  maximum  number  of  chromosome  populations  that  will  be  generated  before  the  top 
scoring  chromosome  will  be  returned  as  the  search  answer;  and 

•  the  generations  with  no  change  in  highest-scoring  (elite)  chromosome  (GensNoChange)  is 
the  second  termination  criterion  which  is  the  number  of  generations  that  may  pass  with  no 
change  in  the  elite  chromosome  before  that  elite  chromosome  will  be  returned  as  the  search 
answer. 

The  scores  matrix  for  the  simulation,  which  is  generated  in  the  GenAlgGen  script,  is  the  set 
of  scores  for  which  the  best  solution  is  to  be  found.  The  attempted  optimization  is  to  find  the  code 
for  each  gene  in  the  solution  chromosome  that  maximizes  the  average  score  for  the  chromosome. 
Finally,  the  first  generation  of  chromosomes  are  generated  randomly. 

5.3.2  Evaluation 

Each  of  the  chromosomes  in  a  generation  must  be  evaluated  for  the  selection  process.  This  is 
accomplished  by  looking  up  the  score  of  each  gene  in  the  chromosome,  adding  the  scores  up,  and 
averaging  the  score  for  the  chromosome.  As  part  of  the  evaluation  process,  the  elite  chromosome 
of  the  generation  is  determined. 

5.3.3  Selection  and  Reproduction 

Chromosomes  for  the  next  generation  are  selected  using  the  roulette  wheel  selection  scheme  [11] 
to  implement  proportionate  random  selection.  Each  chromosome  has  a  probability  of  being  cho¬ 
sen  equal  to  its  score  divided  by  the  sum  of  the  scores  of  all  of  the  generation’s  chromosomes. 
In  order  to  avoid  losing  ground  in  finding  the  highest-scoring  chromosome,  elitism  [1 1]  has  been 
implemented  in  this  benchmark.  Elitism  reserves  two  slots  in  the  next  generation  for  the  highest 
scoring  chromosome  of  the  current  generation,  without  allowing  that  chromosome  to  be  crossed 
over  in  the  next  generation.  In  one  of  those  slots,  the  elite  chromosome  will  also  not  be  subject  to 
mutation  in  the  next  generation. 
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5.3.4  Crossover 


In  the  crossover  phase,  all  of  the  chromosomes  (except  for  the  elite  chromosome)  are  paired  up, 
and  with  a  probability  CrossoverProb,  they  are  crossed  over.  The  crossover  is  accomplished 
by  randomly  choosing  a  site  along  the  length  of  the  chromosome,  and  exchanging  the  genes  of  the 
two  chromosomes  for  each  gene  past  this  crossover  site. 

5.3.5  Mutation 

After  the  crossover,  for  each  of  the  genes  of  the  chromosomes  (except  for  the  elite  chromo¬ 
some),  the  gene  will  be  mutated  to  any  one  of  the  codes  with  a  probability  of  MutationProb. 
With  the  crossover  and  mutations  completed,  the  chromosomes  are  once  again  evaluated  for  an¬ 
other  round  of  selection  and  reproduction. 

5.3.6  Termination 

The  loop  of  chromosome  generations  is  terminated  when  certain  conditions  are  met.  When  the 
termination  criteria  are  met,  the  elite  chromosome  is  returned  as  the  best  solution  found  so  far.  For 
this  benchmark,  there  are  two  criteria:  if  the  number  of  generation  has  reached  a  maximum  number, 
MaxGenerations,  or  if  the  elite  solution  has  not  changed  for  a  certain  number  of  generations, 
GensNoChange. 

5.3.7  Implementation  Notes 

Genetic  algorithms  are  being  used  around  the  world  for  an  enormous  variety  of  applications. 
However,  this  benchmark  of  genetic  algorithms  has  been  designed  with  two  specific  purposes  in 
mind:  matching  computational  tasks  with  processing  units  in  a  general  independent  task  envi¬ 
ronment  and  in  signal  processing  pipeline  tasks.  For  the  general  independent  task  environment, 
the  genes  of  the  chromosomes  are  the  computational  tasks  which  have  arrived  in  the  order  of  their 
gene’s  number,  and  the  codes  are  the  possible  processing  units  upon  which  the  computational  tasks 
can  be  executed.  Similarly,  for  the  signal  processing  pipeline  tasks,  the  genes  of  the  chromosomes 
are  the  pipelined  tasks  that  constitute  a  signal  processing  chain,  and  the  codes  are  the  computa¬ 
tional  unit  mappings  upon  which  the  pipeline  stage  tasks  can  be  run.  The  scores  for  each  code  in 
a  given  gene  position  represents  a  goodness  factor  (for  computational  efficiency,  execution  time, 
or  some  other  measure)  ranging  from  zero  to  one  (one  being  best).  The  goal  then  is  to  find  the 
mapping  of  tasks  onto  processors  that  yields  the  best  score,  in  this  case  a  perfect  score  of  one. 

Random  numbers  for  the  benchmark  C  code  are  generated  using  the  random  number  generator 
from  VSIPL  [10,  p.245].  This  random  number  generator  is  used  so  that  the  code  is  portable  and 
the  workload  is  easily  calculable  (based  on  the  discussion  in  the  standard,  we  assume  1 1  ops  per 
random  number  generated).  Use  of  this  random  number  generator  is  not  required.  However,  if 
a  different  random  number  generator  is  used,  the  workload  given  in  Table  9  should  be  altered 
accordingly. 

For  this  kernel  benchmark,  four  parameter  sets  have  been  included.  These  sets  are  shown  in 
Table  9.  Sets  1  and  2  are  sample  parameters  for  the  general  independent  task  scenario  while  sets 
3  and  4  are  sample  parameters  for  the  digital  signal  processing  pipeline  task  scenario.  For  this 
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Table  9. 

Parameter  sets  for  the  Genetic  Algorithm  Kernel  Benchmark. 


Name 

Description 

Setl 

Set  2 

Set  3 

Set  4 

Units 

Codes 

4 

8 

100 

1000 

codes 

Genes 

8 

96 

5 

10 

genes 

PopSize 

Number  of  chromosomes  in  a 
generation 

50 

200 

100 

400 

chromosomes 

CrossoverProb 

Probability  of  crossing  over  a 
pair  of  chromosomes 

0.01 

0.002 

0.02 

0.03 

MutationProb 

Probability  of  mutating  a 
chromosome 

0.60 

0.60 

0.60 

0.30 

MaxGenerations 

Maximum  number  of  genera¬ 
tions 

500 

2000 

500 

5000 

generations 

GensNoChange 

Maximum  number  of  genera¬ 
tion  with  no  change  in  elite 
chromosome 

50 

150 

50 

500 

generations 

Ops  per  generation 

1750 

77400 

Random  numbers  per  genera¬ 
tion 

898 

38798 

1198 

8798 

numbers 

Ops  for  random  number  gen. 

9878 

426778 

13178 

96778 

operations 

Total  Ops 

11628 

504178 

15478 

113978 

operations 

kernel,  we  define  the  workload  in  operations  (rather  than  floating-point  operations)  per  generation. 
The  workload  is  related  to  the  number  of  genes  and  the  population  size  per  generation. 

Parts  of  the  genetic  algorithm  code  are  embarrassingly  parallel,  including  the  crossover  and 
mutation  sections.  Most  of  the  evaluation  section  is  also  embarrassingly  parallel,  except  for  the 
elite  chromosome  determination  portion.  However,  tiie  selection  and  reproduction  section  can¬ 
not  be  conducted  in  a  parallel  manner  since  a  view  of  the  entire  population  is  necessary.  More 
discussion  on  parallelizing  and  distributing  genetic  algorithms  can  be  found  in  [5]. 
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6.  Further  Kernel  Benchmarks 


These  are  benchmarks  that  might  be  considered  for  a  second  set  of  kernel  benchmarks,  but  are  not 
included  with  this  first  set. 

Image  encoding.  Compress  a  synthetic  aperture  radar  (SAR)  image  for  storage  or  transmission. 
The  dynamic  range  of  SAR  imagery  is  such  that  it  is  not  well  represented  using  conventional 
image  processing  standards  such  as  that  defined  by  the  joint  photographic  experts  group  (JPEG) 
for  image  compression.  The  JPEG  2000  standard  addresses  the  problems  that  JPEG  presents  for 
SAR  images.  For  more  details  on  JPEG  2000,  see  Taubman  and  Marcellin  [12].  Baxter  and  Seibert 
have  performed  an  analysis  of  the  desirable  features  of  an  encoding  algorithm  for  SAR  images  [1]. 
The  major  features  of  the  encoding  algorithm  as  they  describe  it  are: 

•  wavelet  packet  transforms  with  a  Gabor-like  tree  structure  and  smooth  biorthogonal  wavelet 
filters, 

•  trellis-coded  quantization,  and 

•  a  bit-allocation  procedure  based  on  minimizing  distortion  (perceptual  distortion,  based  on 
the  human  visual  system)  and  rate. 

Some  of  these  features  (particularly  trellis-coded  quantization)  are  in  “part  2”  of  the  JPEG  2000 
standard,  and  are  therefore  not  yet  present  in  most  publicly  available  implementations  of  the  stan¬ 
dard. 

Secure  network  protocol.  Transmit  a  message  of  a  given  size  with  authentication  and  encryption, 
based  on  IPSec  or  a  similar  protocol. 

Synchronization.  Mark  a  value  on  a  different  processor  or  in  a  remote  memory  for  exclusive 
access  (lock/unlock  operations). 

Image  processing:  morphological  operations.  Take  an  image  and  perform  an  “opening”  or 
“closing”  operation  on  it.  These  are  integer  convolution  operations. 

Image  processing:  edge  detection.  Perform  edge  detection  in  both  the  x  and  y  dimensions  of  a 
two-dimensional  image. 

Giga-updates  per  second.  Read,  then  write  a  sequence  of  random  memory  locations  in  a  large 
memory  space.  A  description  of  the  benchmark  is  at 
<http://iram.cs.berkeley.edu/~brg/dis/gups/>. 

Incomplete  Gamma  function.  Calculate  values  of  the  incomplete  gamma  function. 
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APPENDIX  A 
Code  Conventions 


Where  MATLAB  code  for  a  kernel  benchmark  is  provided,  at  least  three  MATLAB  executable 
files  are  included.  The  first  is  the  actual  benchmark  executable  itself:  this  is  typically  a  MATLAB 
function.  A  second  file,  a  script  file,  is  provided  to  test  the  function.  This  file  typically  tests  for  the 
presence  of  a  data  file.  If  the  data  file  is  not  present,  a  further  MATLAB  function  (contained  in  the 
third  file)  is  called  to  generate  the  data;  otherwise,  the  data  is  read  from  the  file. 

For  a  particular  kernel,  where  the  actual  benchmark  function  is  named  kernel  .m,  the  test 
script  is  named  kernelTest  .m  and  the  data  generator  is  named  kernelGen  .m.  For  exam¬ 
ple,  for  the  FIR  filter  bank  kernel,  the  function  is  named  FIRBank.m,  the  test  script  is  named 
FIRBankTes  t .  m,  and  the  data  generator  is  named  FIRBankGen .  m. 

The  base  names  for  each  of  the  kernels  are  listed  in  Table  10.  Note  that  no  code  or  data  is 
provided  for  the  database  kernel.  Example  C  code,  but  no  data,  is  provided  for  the  comer-tum 
kernel. 


Table  10. 

Kernel  benchmark  code  names. 


Kernel  name 

Base  file  name 

FIR  filter  bank 

FIRBank 

Singular  value  decomposition 

SVD 

CFAR  detection 

Cfar 

Pattern  match 

PatternMatch 

Genetic  algorithm 

GenAlg 
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APPENDIX  B 
Revisions 

The  original  version  of  this  document  was  issued  July  31,  2002.  This  is  the  first  revision  of  the 
document.  It  introduces  the  following  changes: 

•  an  error  in  the  GEAR  workload  table  (Table  4)  was  corrected; 

•  a  description  of  the  reduced  SVD  algorithm,  including  workload  estimates,  was  added; 

•  the  description  of  the  database  kernel  benchmark  was  heavily  revised  and  the  workload  was 
changed; 

•  a  description  of  the  random  number  generator  used  for  the  genetic  algorithm  was  added  and 
its  workload  was  updated  in  a  corresponding  way; 

•  the  pattern  match  benchmark  description  and  workload  were  updated  to  reflect  a  more  effi¬ 
cient  implementation  and  to  introduce  a  second  data  set;  and 

•  the  discussion  of  metrics  (in  particular,  of  efficiency  and  program  stability)  was  updated  to 
reflect  that  efficiency  is  not  always  easily  calculable  for  the  different  kernels. 
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